We will now shift gears and reformulate our question: we are going to shift to textual data (a restaurant's name and its street) as features predicting whether it has scored an A at the health inspection. This should lead to a more interesting analysis than the poor one we conducted based on the categorical variable of area.
The outline of the procedure we are going to follow is:
In [1]:
import warnings
warnings.filterwarnings('ignore')
In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(rc={"axes.labelsize": 15});
# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5;
plt.rcParams['axes.grid'] = True;
plt.gray();
In [4]:
#Reading the dataset in a dataframe using Pandas
df = pd.read_csv("../data/data.csv")
#Print first observations
df.head()
Out[4]:
Let us start our manipulation of restaurant names:
In [5]:
Names = pd.Series(df['Restaurant_Name'].values)
We will remove all words that are 3 characters long or smaller:
In [6]:
import re
shortword = re.compile(r'\W*\b\w{1,3}\b')
In [7]:
for i in range(len(Names)):
Names[i] = shortword.sub('', Names[i])
In [8]:
# As an example, "JR's Tacos" is now just " Tacos"
Names[3]
Out[8]:
In [9]:
# Add a new column into our DataFrame:
df['Names'] = Names
In [10]:
df['Names'].head(10)
Out[10]:
In [11]:
df.columns
Out[11]:
Our first collection of feature vectors will come from the new "Names" column. We are still trying to predict whether a restaurant falls under the "pristine" category (Grade A, score greater than 90) or not. We could also try to see whether we could predict a restaurant's grade (A, B, C or F)
In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import cross_validation
from sklearn.naive_bayes import MultinomialNB
# Turn the text documents into Bag of Words feature vectors
# We'll throw out any terms that appear in only one document
vectorizer = CountVectorizer(min_df=1, stop_words="english")
X = vectorizer.fit_transform(df['Names'])
y = df['Letter_Grade']
# Train/test split for cross-validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size = 0.8)
# Fit a classifier on the training set
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train, y_train) * 100))
# Evaluate the classifier on the testing set
print("Testing score: {0:.1f}%".format(
classifier.score(X_test, y_test) * 100))
It seems our Multinomial Naive Bayes classifier does significantly better on predicting a restaurant's status (whether it has gotten a "pristine" score" or not) given the restaurant's name than what we have seen so far with the area of town division.
In [13]:
# Some information about our Bag of Words feature vector:
In [14]:
len(X_train.data)
Out[14]:
In [15]:
n_samples, n_features = X_train.shape
In [16]:
n_samples
Out[16]:
In [17]:
n_features
Out[17]:
In [18]:
# The vocabulary of our vectorizer, i.e. the unique words comprising it:
len(vectorizer.vocabulary_)
Out[18]:
In [19]:
vectorizer.get_feature_names()[n_features / 3:n_features / 3 + 10]
Out[19]:
In [20]:
target_predicted_proba = classifier.predict_proba(X_test)
percentages = pd.DataFrame(target_predicted_proba, columns=df['Letter_Grade'].unique())
In [21]:
# A table of probabilities for each one of the 3223 restaurants in the test set to be assigned a certain letter grade:
percentages.head()
Out[21]:
In [22]:
len(percentages)
Out[22]:
By default the decision threshold is 0.5: if we vary the decision threshold from 0 to 1 we could generate a family of binary classifier models that address all the possible trade offs between false positive and false negative prediction errors.
Let us use a pipeline in order to perform 10-fold cross validation:
In [56]:
pipeline = Pipeline((
('vec', CountVectorizer(max_df = 0.8, ngram_range = (1, 2))),
('clf', MultinomialNB(alpha = 0.01)),
))
_ = pipeline.fit(df['Names'], df['Letter_Grade'])
In [57]:
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem
scores = cross_val_score(pipeline, df['Names'],
df['Letter_Grade'], cv=10)
scores.mean(), sem(scores)
Out[57]:
In [58]:
vec_name, vec = pipeline.steps[0]
clf_name, clf = pipeline.steps[1]
feature_names = vec.get_feature_names()
target_names = df['Letter_Grade'].unique()
feature_weights = clf.coef_
feature_weights.shape
Out[58]:
In [59]:
len(feature_names)
Out[59]:
In [60]:
def print_top10(vectorizer, clf, class_labels):
"""Prints features with the highest coefficient values, per class"""
feature_names = vectorizer.get_feature_names()
for i, class_label in enumerate(class_labels):
top10 = np.argsort(clf.coef_[i])[-10:]
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))
In [61]:
print_top10(vectorizer, classifier, target_names)
In [62]:
from sklearn.metrics import classification_report
predicted = pipeline.predict(df['Restaurant_Name'])
In [63]:
print(classification_report(df['Letter_Grade'], predicted,
target_names=df['Letter_Grade'].unique()))
In [64]:
from sklearn.metrics import confusion_matrix
pd.DataFrame(confusion_matrix(df['Letter_Grade'], predicted),
index = pd.MultiIndex.from_product([['actual'], target_names]),
columns = pd.MultiIndex.from_product([['predicted'], target_names]))
Out[64]:
In [65]:
df.head(3)
Out[65]:
Let us now follow a similar approach in order to isolate the street name from the address string:
In [67]:
streets = df['Geocode'].apply(pd.Series)
In [68]:
streets = df['Geocode'].tolist()
In [69]:
split_streets = [i.split(' ', 1)[1] for i in streets]
In [70]:
split_streets[0]
Out[70]:
In [71]:
split_streets = [i.split(' ', 1)[1] for i in split_streets]
In [72]:
split_streets[0]
Out[72]:
In [73]:
split_streets = [i.split(' ', 1)[0] for i in split_streets]
In [74]:
split_streets[0]
Out[74]:
In [75]:
for i in range(len(split_streets)):
split_streets[i] = shortword.sub('', split_streets[i])
In [76]:
split_streets[0]
Out[76]:
In [77]:
# Create a new column with the street:
df['Street_Words'] = split_streets
In [78]:
# Turn the text documents into vectors of tf-idf
# We'll throw out any terms that appear in only one document
#vectorizer = TfidfVectorizer(min_df=2) # recipe for avoiding overfitting; others & alpha parameters can be tuned.
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(df['Street_Words'])
y = df['Letter_Grade']
# Train/test split for cross-validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size = 0.8)
# Fit a classifier on the training set
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train, y_train) * 100))
# Evaluate the classifier on the testing set
print("Testing score: {0:.1f}%".format(
classifier.score(X_test, y_test) * 100))
In [79]:
n_samples, n_features = X_train.shape
In [80]:
vectorizer.get_feature_names()[n_features / 3:n_features / 3 + 10]
Out[80]:
In [81]:
len(vectorizer.vocabulary_)
Out[81]:
In [82]:
target_predicted_proba = classifier.predict_proba(X_test)
pd.DataFrame(target_predicted_proba[:10], columns=df['Letter_Grade'].unique())
Out[82]:
In [85]:
pipeline = Pipeline((
('vec', CountVectorizer(max_df = 0.8, ngram_range = (1, 2))),
('clf', MultinomialNB(alpha = 0.01)),
))
_ = pipeline.fit(df['Street_Words'], df['Letter_Grade'])
In [86]:
scores = cross_val_score(pipeline, df['Street_Words'],
df['Letter_Grade'], cv=3)
scores.mean(), sem(scores)
Out[86]:
In [87]:
vec_name, vec = pipeline.steps[0]
clf_name, clf = pipeline.steps[1]
feature_names = vec.get_feature_names()
target_names = df['Letter_Grade'].unique()
feature_weights = clf.coef_
feature_weights.shape
Out[87]:
In [88]:
predicted = pipeline.predict(df['Street_Words'])
In [89]:
print(classification_report(df['Letter_Grade'], predicted,
target_names=df['Letter_Grade'].unique()))
In [90]:
pd.DataFrame(confusion_matrix(df['Letter_Grade'], predicted),
index = pd.MultiIndex.from_product([['actual'], target_names]),
columns = pd.MultiIndex.from_product([['predicted'], target_names]))
Out[90]:
In [91]:
print_top10(vectorizer, classifier, target_names)
In [ ]: